[SPARK-56253][PYTHON][CONNECT] Make spark.read.json accept DataFrame input by Yicong-Huang · Pull Request #55097 · apache/spark

Yicong-Huang · 2026-03-30T16:51:39Z

What changes were proposed in this pull request?

Allow spark.read.json() to accept a DataFrame as input, in addition to file paths and RDDs. The first column of the input DataFrame must be of StringType; additional columns are ignored.

Why are the changes needed?

Parsing in-memory JSON text into a structured DataFrame currently requires sc.parallelize(), which is unavailable on Spark Connect. Accepting a DataFrame as input provides a Connect-compatible alternative. This is the inverse of DataFrame.toJSON().

Part of SPARK-55227.

Does this PR introduce any user-facing change?

Yes. spark.read.json() now accepts a DataFrame as input. The first column must be StringType; additional columns are ignored.

How was this patch tested?

New tests in test_datasources.py (classic) and test_connect_readwriter.py (Connect).

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang · 2026-04-03T08:34:17Z

+    def ds: Dataset[String] = {
+      val input = transformRelation(rel.getInput)
+      val inputSchema = Dataset.ofRows(session, input).schema
+      require(inputSchema.fields.length == 1,


Maybe we should throw InvalidInputErrors as others

As we don't want to validate on the python side, all errors will be thrown on the scala side. so in classic we will have UNSUPPORTED_DESERIALIZER errors. for parity tests, it might be better to keep connect error also as UNSUPPORTED_DESERIALIZER, instead of InvalidInputErrors?

I have used InvalidInputErrors in the end

LuciferYang · 2026-04-03T08:36:06Z

+
+    def test_json_with_dataframe_input_non_string_column(self):
+        int_df = self.spark.createDataFrame([(1,), (2,)], schema="value INT")
+        with self.assertRaises(Exception):


Consider using assertRaisesRegex(Exception, "exactly one column|StringType") to at least verify the error message content

Thanks! I added more regex to make the check tighter

Yicong-Huang · 2026-04-04T01:55:50Z

+      val input = transformRelation(rel.getInput)
+      val inputSchema = Dataset.ofRows(session, input).schema
+      if (inputSchema.fields.length != 1) {
+        throw InvalidInputErrors.parseInputNotSingleColumn(inputSchema.fields.length)
+      }
+      if (inputSchema.fields.head.dataType != org.apache.spark.sql.types.StringType) {
+        throw InvalidInputErrors.parseInputNotStringType(inputSchema.fields.head.dataType)
+      }
+      Dataset(session, input)(Encoders.STRING)
+    }


I added the checks here because otherwise INT can be implicitly cast to STRING.

hvanhovell · 2026-04-07T13:42:17Z

      reader
    }
-    def ds: Dataset[String] = Dataset(session, transformRelation(rel.getInput))(Encoders.STRING)
+    def ds: Dataset[String] = {


Let's try to avoid creating the dataset twice. Analysis can be somewhat expensive. Do this instead:

val input = transformRelation(rel.getInput) val df = Dataset.ofRows(session, input) val inputSchema = df.schema if (inputSchema.fields.length != 1) { throw InvalidInputErrors.parseInputNotSingleColumn(inputSchema.fields.length) } if (inputSchema.fields.head.dataType != org.apache.spark.sql.types.StringType) { throw InvalidInputErrors.parseInputNotStringType(inputSchema.fields.head.dataType) } df.as(Encoders.STRING)

hvanhovell · 2026-04-07T13:43:59Z

+    def ds: Dataset[String] = {
+      val input = transformRelation(rel.getInput)
+      val inputSchema = Dataset.ofRows(session, input).schema
+      if (inputSchema.fields.length != 1) {


I am a bit on the fence about this one. It is fine to have multiple columns, as long as the first one is a string.

hmm, just for clarification, what would be the behavior of multiple columns? should we just take the first column and ignore the rest?

hvanhovell · 2026-04-07T13:44:27Z

+        throw InvalidInputErrors.parseInputNotSingleColumn(inputSchema.fields.length)
+      }
+      if (inputSchema.fields.head.dataType != org.apache.spark.sql.types.StringType) {
+        throw InvalidInputErrors.parseInputNotStringType(inputSchema.fields.head.dataType)


This is technically a behavior change.

…e input

… constraint

…Utils to avoid type erasure conflict

hvanhovell · 2026-04-08T14:16:41Z

+      if (fields.head.dataType != org.apache.spark.sql.types.StringType) {
+        throw QueryCompilationErrors.parseInputNotStringTypeError(fields.head.dataType)
+      }
+      df.select(df.columns.head).as(Encoders.STRING)


You don't really have to add a projection here. df.as(Encoders.STRING) should work as well.

I tried removing the projection, but df.as(Encoders.STRING) on a multi-column DataFrame throws UNSUPPORTED_DESERIALIZER.FIELD_NUMBER_MISMATCH because the STRING encoder expects exactly one column. So the projection is needed to support multi-column DataFrames (using the first column). I'll keep it as-is.

the projection is needed to support multi-column DataFrames (using the first column)

I think it should fail in this case?

I think @hvanhovell wants to support multiple column case, #55097 (comment) but I am a bit not sure about how we should support multi column input.
Currently it silently drops columns after the first one when receiving more than one columns. I could also change it to raise an exception. Or, do we want to somehow join the remaining columns back after we parse the json from the first column?

@hvanhovell @zhengruifeng what do you think?

I personally feel it should just fail, but if we want to support multiple columns by accept the first column, I think we need to document such behavior.
also cc @cloud-fan and @HyukjinKwon WDYT?

silently dropping things is an anti pattern, let's fail explicitly.

Ok let me change it to fail the case. Then we will not be able to support multi column.

hvanhovell

LGTM - one small nit.

HyukjinKwon · 2026-04-09T00:16:53Z

Merged to master.

zhengruifeng · 2026-04-09T03:52:14Z

+        )
+        result = self.spark.read.json(multi_df)
+        expected = [Row(name="Alice"), Row(name="Bob")]
+        self.assertEqual(sorted(result.collect(), key=lambda r: r.name), expected)


I think in this case it should fail?

…me input in spark.read.json ### What changes were proposed in this pull request? Follow-up to #55097. Reject multi-column DataFrame input in `spark.read.json()` explicitly instead of silently using the first column and dropping the rest. Also renames error conditions and methods from `PARSE_INPUT_*` to `DATAFRAME_INPUT_*` since these are query compilation errors, not parse errors. ### Why are the changes needed? Per review feedback on #55097 from cloud-fan and zhengruifeng: silently dropping columns is an anti-pattern. Multi-column DataFrame input should fail explicitly. ### Does this PR introduce _any_ user-facing change? Yes. `spark.read.json(df)` now raises `DATAFRAME_INPUT_NOT_SINGLE_COLUMN` when the input DataFrame has more than one column (previously it silently used only the first column). Zero-column input now also raises `DATAFRAME_INPUT_NOT_SINGLE_COLUMN` instead of `PARSE_INPUT_NOT_STRING_TYPE`. ### How was this patch tested? Updated existing tests in `test_datasources.py` (classic) and `test_connect_readwriter.py` (Connect) to verify that multi-column and zero-column input raises the expected error. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55301 from Yicong-Huang/SPARK-56253-reject-multicol. Authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…nput ### What changes were proposed in this pull request? This PR adds support for passing a `DataFrame` containing CSV strings directly to `spark.read.csv()`, following the same pattern established by #55097 (SPARK-56253) for `spark.read.json()`. ### Why are the changes needed? Adding DataFrame support to `csv()` makes the API consistent with `json()` and enables Connect-compatible CSV parsing without `sc.parallelize()`. ### Does this PR introduce _any_ user-facing change? Yes. `spark.read.csv()` now accepts a `DataFrame` with a single string column as input, in addition to the existing `str`, `list`, and `RDD` inputs. ```python csv_df = spark.createDataFrame([("Alice,25",), ("Bob,30",)], schema="value STRING") spark.read.csv(csv_df, schema="name STRING, age INT").show() # +-----+---+ # | name|age| # +-----+---+ # |Alice| 25| # | Bob| 30| # +-----+---+ ``` ### How was this patch tested? Added 10 new test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #55274 from Yicong-Huang/SPARK-56255. Authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Yicong-Huang mentioned this pull request Mar 30, 2026

[SPARK-56257][PYTHON][CONNECT] Support DataFrame input for spark.read.json/csv/xml #55057

Closed

zhengruifeng reviewed Mar 31, 2026

View reviewed changes

Comment thread python/pyspark/sql/readwriter.py Outdated

zhengruifeng requested review from LuciferYang and hvanhovell March 31, 2026 04:21

zhengruifeng reviewed Apr 1, 2026

View reviewed changes

Comment thread python/pyspark/sql/tests/test_datasources.py

zhengruifeng reviewed Apr 1, 2026

View reviewed changes

Comment thread python/pyspark/sql/connect/readwriter.py

Yicong-Huang force-pushed the SPARK-56253 branch from 5b1c36c to b663689 Compare April 3, 2026 05:27

zhengruifeng reviewed Apr 3, 2026

View reviewed changes

Comment thread python/pyspark/sql/readwriter.py Outdated

LuciferYang reviewed Apr 3, 2026

View reviewed changes

Yicong-Huang commented Apr 4, 2026

View reviewed changes

Yicong-Huang requested review from LuciferYang and zhengruifeng April 4, 2026 01:56

Yicong-Huang force-pushed the SPARK-56253 branch from a768b6a to 250ddd3 Compare April 4, 2026 05:07

HyukjinKwon approved these changes Apr 5, 2026

View reviewed changes

hvanhovell reviewed Apr 7, 2026

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala Outdated

hvanhovell reviewed Apr 7, 2026

View reviewed changes

Comment thread sql/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/InvalidInputErrors.scala

hvanhovell reviewed Apr 7, 2026

View reviewed changes

Comment thread sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala Outdated

Yicong-Huang added 10 commits April 7, 2026 23:54

feat: support DataFrame input for spark.read.json

323bfca

fix: mypy errors and schema passthrough in connect json

e34911d

fix: use proper proto type for format, add connect schema test

f395737

refactor: add JVM-side jsonFromDataFrame, add negative tests

a619537

refactor: move jsonFromDataFrame to PythonSQLUtils

609e7bf

fix: use fully qualified DataFrameReader import

fc96d9c

fix: validate DataFrame schema in jsonFromDataFrame

030f413

fix: address review comments - skip validation, use assertRaisesRegex

dfbd9b3

fix: add JVM-side validation for non-string and multi-column DataFram…

9e0fc80

…e input

fix: use Spark Error framework for DataFrame input validation

2a0f502

Yicong-Huang added 5 commits April 7, 2026 23:54

fix: correct import ordering for scalastyle

1fd89e6

fix: add sqlState for new error conditions

8b39383

fix: sort error conditions in alphabetical order

7c462e4

refactor: address review feedback - unify errors, relax single-column…

0642d64

… constraint

chore: retrigger CI

cd205cc

Yicong-Huang force-pushed the SPARK-56253 branch from c086491 to cd205cc Compare April 8, 2026 00:07

fix: move DataFrame validation from DataFrameReader.json to PythonSQL…

30375b7

…Utils to avoid type erasure conflict

hvanhovell reviewed Apr 8, 2026

View reviewed changes

hvanhovell approved these changes Apr 8, 2026

View reviewed changes

HyukjinKwon closed this in 21ada68 Apr 9, 2026

zhengruifeng reviewed Apr 9, 2026

View reviewed changes

This was referenced Apr 9, 2026

[SPARK-56255][PYTHON][CONNECT] Make spark.read.csv accept DataFrame input #55274

Closed

[SPARK-56253][PYTHON][CONNECT][FOLLOW-UP] Reject multi-column DataFrame input in spark.read.json #55301

Closed

Conversation

Yicong-Huang commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Yicong-Huang commented Mar 30, 2026 •

edited

Loading

cloud-fan Apr 10, 2026 •

edited

Loading